3.2 - Working with text (Chinese)

Here we adapt the examples from the previous exercise to Chinese text. The major difference in this case is that Chinese is typically written without spaces between words, making it impossible to break up into words in a neive way (we will address this with a special library in exercise 4.2). However, we can still make some progress just by breaking up into characters and finding the most common ones. Chinese text also uses a different set of special characters and punctuation which we need to consider.


In [ ]:
import re

In [ ]:
filename = "data/journey.txt"
raw_text = open(filename).read()

n_chars = len(raw_text)
print("length of text:", n_chars)

In [ ]:
paragraphs = raw_text.split('\n')
paragraphs = [p for p in paragraphs if len(p) > 0]
print("number of paragraphs:", len(paragraphs))

characters = []
for p in paragraphs:
    characters += re.sub('[,,。,:,﹔,《,》,」,「,一, ]', '', p)
print("number of characters:", len(characters))

In [ ]:
# https://docs.python.org/3/library/collections.html#counter-objects
import collections

charSet = collections.Counter(characters)
common = charSet.most_common(10)
print(common[0])
print("most common characters:", [c[0] for c in common])

uniqueChars = list(charSet)
print("unique characters:", len(uniqueChars))